The allotetraploid horseradish genome provides insights into subgenome diversification and formation of critical traits

Polyploidization can provide a wealth of genetic variation for adaptive evolution and speciation, but understanding the mechanisms of subgenome evolution as well as its dynamics and ultimate consequences remains elusive. Here, we report the telomere-to-telomere (T2T) gap-free reference genome of allotetraploid horseradish (Armoracia rusticana) sequenced using a comprehensive strategy. The (epi)genomic architecture and 3D chromatin structure of the A and B subgenomes differ significantly, suggesting that both the dynamics of the dominant long terminal repeat retrotransposons and DNA methylation have played critical roles in subgenome diversification. Investigation of the genetic basis of biosynthesis of glucosinolates (GSLs) and horseradish peroxidases reveals both the important role of polyploidization and subgenome differentiation in shaping the key traits. Continuous duplication and divergence of essential genes of GSL biosynthesis (e.g., FMOGS-OX, IGMT, and GH1 gene family) contribute to the broad GSL profile in horseradish. Overall, the T2T assembly of the allotetraploid horseradish genome expands our understanding of polyploid genome evolution and provides a fundamental genetic resource for breeding and genetic improvement of horseradish.


Supplementary Note 1. Genome assembly using Oxford Nanopore Technology (ONT) and Hi-C data.
We conducted the genome assembly using the comprehensive and customized pipeline. First, the ONT long reads were assembled into contigs using NextDenovo software (v2.0, https://github.com/Nextomics/NextDenovo) with default parameters. The single-base accuracy was improved using Illumina reads and Pilon (v1.24) software 1 . Based on the genome survey using Illumina reads, the estimated genome size was approximately 636 million (M)bp. Initially, we assembled the long reads into 292 contigs with an N50 of 7.95 Mbp, which totaled 611 Mbp and accounted for 96% of the estimated genome size (Supplementary Table 2). Second, we employed a scaffolding approach using HiC-seq data, taking into account the possibility of chimeric assemblies caused by homoeologous contigs from the two horseradish subgenomes ( Supplementary Fig. 2). To improve assembly continuity and avoid missing sequences, we implemented the following steps. First, we performed genome annotation using the MAKER-P (v2.29+) pipeline on the polished contigs, resulting in the generation of a gene set (see the detailed description of genome annotation in the Methods part) 2 . Second, we analyzed pairwise synteny and obtained collinear segments using the JCVI (v1.2.20) software 3 . The pairwise Ks values of the homoeologous genes were calculated using PAML (v4.9j) 4 . Third, based on the Ks values, we predicted synteny blocks derived from the two different subgenomes. HiC-seq reads were aligned to contigs using Juicer (v1.6) 5 . Paired read links between the predicted homoeologous contigs from the two subgenomes were removed. Finally, we used the 3D de novo assembly (3D-DNA) 5 pipeline (v201008) for scaffolding.
We partitioned and phased the subgenomes based on repetitive k-mers and orthologous genes using SubPhaser (v1.2) 6 . To avoid the effects of large-scale genomic deletions on clustering, we used only chromosomal regions within the syntenic region for analysis. We ran the SubPhaser with different parameters ('-k 13 -q 100 -f 2', -k 15 -q 100 -f 2', '-k 15 -q 200 -f 2', '-k 13 -q 200 -f 2'). In addition, the orthologous genes were also examined. We adopted the branch length of the phylogenetic tree constructed based on the orthologous genes. The Arabidopsis thaliana genome was used to determine the synteny block and orthologous gene pairs with each scaffold using JCVI software 3 . Orthologous gene pairs were subjected to multiple sequence alignment using MAFFT (v7.490) software with default parameters 7 and concatenated within each scaffold. Phylogenetic trees were constructed based on the concatenated sequences using FASTTREE (v2.1.11) with default parameters 8 .
Multiple methods were used to assess the quality of the genome assembly. The read re-mapping ratio was assessed using Illumina reads and ONT long reads. Illumina short reads were aligned to the genome using BWA MEM (v0.7.17 ) software with default parameters 9 , while RNA reads were aligned using Hisat2 (v2.2.1) software with default parameters 10 . Minimap2 (v2.24) 11 software with parameters "-a -x map-ont -t 30" was used to map the long reads. The integrity of the genetic region was assessed using BUSCO (v4.0) software 12 . In addition, the quality of repeat sequences was assessed using the LTR Assembly Index (LAI) 13 .
Karyotypes of Brassicaceae species were determined by comparative chromosome painting (CCP) by Lysak and his colleagues [14][15][16] . The concept of Ancestral Crucifer Karyotype (ACK; n = 8) containing 22 genomic blocks (GBs, named A to X) was proposed 17 . To analyze the horseradish karyotype, we extracted the GBs from the A. thaliana genome and performed collinearity analysis between the genomes. We observed that nearly all ancestral GBs could be traced in the horseradish genome. However, there were also differences between the assemblies of the two subgenomes ( Supplementary Fig. 6), such as losses of large genomic segments (blocks F and R). We constructed the horseradish karyotype based on the ancestral GBs and compared it with the karyotype generated by CCP 18 . We observed six ancestral chromosomes (AK1, AK2, AK3, AK4, AK5, and AK7) in both subgenomes ( Supplementary Fig. 6). Two chromosomes (a/b06 and a/b08) originated through a reciprocal translocation involving ancestral chromosomes AK6 and AK8, in accordance with the previously published cytogenomic map 18 (Supplementary Fig. 23). The partial losses of GB F on chromosome a03 and GB R on chromosome b08 have not been detected in the comparative cytogenomic map (Supplementary Fig. 6) 18 . To elucidate this discrepancy, we performed CCP validation of the two genomic blocks. As two genomic copies of both GBs were identified by CCP ( Supplementary Fig. 7), we concluded that the purported deletions were due to an incomplete genome assembly ( Supplementary Fig. 7).

Supplementary Note 2. Genome assembly using PacBio HiFi and Hi-C data.
To further improve the genome assembly, we performed genome sequencing and generated PacBio HiFi long reads 19 . The tender leaves of horseradish were harvested and used for DNA extraction and sequencing library construction. The SMRTbell libraries were prepared according to the standard protocol of PacBio (Pacific Biosciences) and sequenced on the PacBio Sequel II platform to generate HiFi reads 20,21 . Finally, we obtained 27.71 Gbp of consensus reads after processing (Supplementary Table 1). The contigs from HiFi reads were assembled using HiFiasm (v0.17.7) software with default parameters 22 . Ultimately, we obtained a HiFi-based genome assembly with a genome size of 612 Mbp and an N50 length of 34 Mbp (Supplementary Table 5). We aligned the contigs to the ONT-based genome assembly using minimap2 11 and generated the dot plot using the dotPlotly software (https://github.com/tpoorten/dotPlotly). The dot plot showed that the missing genomic blocks in the ONT-based genome assembly were present in the HiFi-based genome assembly ( Supplementary Fig. 8), indicating improved integrity and continuity of the HiFi assembly. Taking advantage of the accuracy of the long HiFi reads and the high continuity of the assembly, we directly performed scaffolding. This resulted in a genome assembly comprising 16 pseudochromosomes that anchored 97.29% of the assembled sequence, leaving only 8 gaps (Supplementary Table 6). Notably, nine of the 16 pseudochromosomes consisted of a single contig, highlighting the advantages of high-accuracy HiFi reads in resolving complex polyploid genomes (Supplementary Table 6).
Considering the improved continuity and integrity of the HiFi-assembled genome, we selected it as the genome backbone and filled the remaining gaps using the ONT-based genome with TGSgapcloser (v1.2.1) software 23 . After closing all gaps, we obtained a gap-free reference genome consisting of 16 chromosomes with the total length of 610 Mbp. By searching for the seven-base telomere repeat sequence (CCCTAAA), we identified 31 telomeres (15 pairs plus one singleton) and constructed 15 T2T pseudomolecules (Supplementary Table 7). The quality of the genome assembly was assessed using methods described earlier (see Supplementary Note 1).

Supplementary Note 3. Identification of intact (retro)transposons.
Due to the high abundance of LTR retrotransposons in the horseradish genome, we investigated the dynamics of LTR retrotransposons during genome evolution. First, we used the LTR-retriever software (v2.9.0) 24 to integrate the outputs from LTRharvest (v2.9.0) 25 and LTR_FINDER (v1.07) 26 using default parameters. A total of 16,977 intact (retro)transposons were identified in the horseradish genome, consisting of 5,695 DNA transposons and 11,282 LTR-RTs (Supplementary Table 16). The distribution of intact (retro)transposons was similar in both subgenomes (Supplementary Data 2). Analysis of the insertion timing of LTR-RTs across the genome showed that recent bursts of different LTR-RT types occurred ~0.2 million years ago (Fig. 2c). In addition, we observed variation in burst timing between different LTR-RT types (Fig. 2c), possibly due to differences in transpositional activity and/or epigenetic status of the two LTR-RT superfamilies.

Supplementary Note 4. Clustering of intact LTR-RTs and characterization of LTR-RT families.
To further characterize the LTR-RTs family and classify the clades among different LTR-RT families, we developed an in-house pipeline volcano (https://github.com/maypoleflyn/valcano). In brief, accurate identification of LTR-RTs was conducted using LTR-retriever 24 . The LTR-RTs were classified based on the identity of LTR sequences using CD-HIT-EST (v4.8.1) 27 with parameters 'c 0.8 -aL 0.8 -T 0 -M 0 -n 5 -d 200'. RepeatMasker (v4.1.2) software was used with the clustered LTR-RT family sequences as a library to determine the copy number and coverage of each family in the genome. To determine the phylogenetic relationship of the identified LTR-RTs, the reverse transcription domains of different types of LTR-RTs were used as queries to search and obtain reverse transcription domain sequences for each LTR-RT element using tBLAST 28 . The obtained amino acid sequences of reverse transcription domains were subjected to multiple alignments using MAFFT (v.7.0) 7 with default parameters. Phylogenetic trees were inferred using FASTTREE (v.2.1) 8 with default parameters and visualized using ITOL (https://itol.embl.de/). Expression of transposable elements was estimated using Telescope software (v1.0.3) 29 . RNA-Seq reads were aligned to the genome assembly using HISAT2 software 10 . The annotation GTF file was obtained from the LTR-retriever.
We used RepeatMasker with a custom library created from clustered LTR-RT family sequences to assess the coverage of each LTR-RT family in the genome. Our analysis revealed that the repetitive nature of the genome is primarily influenced by specific LTR-RT families exhibiting exceptionally high genome coverages (Supplementary Data 3, Supplementary Fig. 15). Notably, the 150 LTR-RT families with the highest genome coverage accounted for nearly 50% of the total LTR-RT genome coverage (Supplementary Data 3). Among these families, FAM1 classified as 'unknown' LTR-RTs, had the highest coverage (~2.88%) (Supplementary Data 3). All clustered sequences of LTR-RT families can be found on github (https://github.com/maypoleflyn/HG).

Supplementary Note 5. The influence of the intact LTR-RTs on nearby genes.
We performed further studies to examine the influence of intact LTR-RTs on nearby genes. A total of approximately 1,497 genes were found in close proximity (<1000 bp) to intact Gypsy/Copia LTR-RTs ( Supplementary Fig. 16a, b). To assess the expression levels near different types of LTR-RTs, we calculated the average expression value in the different tissues (root, stem, and leaf) using the deeptools software with a sliding window size of 50 bp. Our results showed that LTR-RTs, particularly Gypsy LTR-RTs, had a negative impact on the expression levels of nearby genes ( Supplementary Fig. 16a, b). Consequently, we observed increased methylation levels in the vicinity of LTR-RTs, with Gypsy LTR-RTs having significantly higher methylation levels compared to the Copia type ( Supplementary Fig. 16e, f). In addition, we quantified the expression of different retrotransposons and found a higher proportion of individuals with high expression levels among Copia LTR-RTs ( Supplementary Fig. 16g). KEGG enrichment analysis of genes located near different types of LTR-RTs (within 1,000 bp) indicated their association with crucial metabolic pathways/BRITE hierarchies such as glutathione metabolism, starch and sucrose metabolism, biosynthesis of various plant secondary metabolites, fatty acid degradation, and lipid metabolism ( Supplementary Fig. 16h, i). Based on these findings, we propose that LTR-RTs may exert a negative influence on nearby gene expression by increasing methylation levels, thereby affecting certain biological processes.

Supplementary Note 6. The influence of fragmented LTR-RTs on DNA methylation.
Fragmented LTR-RTs occupied a dominant proportion in terms of their abundance in the genome compared to intact and active LTR-RTs. Based on the observation of increased methylation levels in the vicinity of intact LTR-RTs, we extended our investigation to the methylation levels in the vicinity of fragmented LTR-RTs. Using the sequences of full-length LTR-RTs as a library, we ran RepeatMask software to identify and annotate fragmented LTR-RTs. The fragmented LTR-RTs were classified into three groups (group 1: <30%; group 2: >=30% and <=60%; group 3: >60%) based on coverage (the length of fragmented LTR-RTs/the length of corresponding intact LRT-RTs). Our observations revealed significantly increased methylation levels in different types/groups of fragmented LTR-RTs. Among flanking sequences, group 1 had the lowest methylation level, while higher methylation levels were observed in groups 2 and 3 ( Supplementary Fig. 32). We then examined only the fragments with coverage greater than 50% and divided these fragments into three classifications based on sequence identity (Classification I: >70% and <80%; Classification II >=80% and <=90%; Classification III: >90% and <100%). The methylation level of the fragmented LTR-RTs also increased, with sequences highly similar to intact LTR-RTs showing higher methylation levels ( Supplementary Fig. 33). Based on these observations, it is suggested that the high methylation levels in the LTR-RTs persist even in the absence of their activity.

Supplementary Note 7. The influence of fragmented LTR-RTs on nearby genes.
To investigate the impact of fragmented LTR-RTs on genes, the occurrence of high-confidence fragmented LTR-RTs longer than 200 bp was calculated in different genic regions, including exons, introns, and upstream regions (-500 bp). The analysis revealed that fragmented LTR-RTs were present in genic regions of 9,336 genes. Specifically, they were found in the exons of 1,769 genes, in the introns of 2,383 genes, and in the upstream regions within 500 bp of 6,951 genes ( Supplementary Fig. 34). These results suggest that a large number of the affected genes may be affected by alterations in gene structure or expression due to the presence of fragmented LTR-RTs. Examination of syntelogs between the two subgenomes identified a total of 3,683 genes that harbored fragmented LTR-RTs specific to the A-subgenome, and 3,676 genes that contained fragmented LTR-RTs specific to the B-subgenome ( Supplementary Fig. 34). Thus, fragmented LTR-RTs play a crucial role in subgenome diversification.
Analysis of fragmented LTR-RTs in the formation of differentially methylated syntelog pairs between subgenomes revealed that approximately 24.04% (or 24.70%) of syntelog pairs with a significantly higher methylation level in the A (or B) subgenome contained fragmented LTR-RTs exclusively in the A (or B) subgenome ( Supplementary Fig. 34). This suggests a possible link between the presence of fragmented LTR-RTs and the observed different methylation patterns in the subgenomes, suggesting their possible role in shaping subgenome-specific methylation profiles.

Supplementary Note 8. Analysis of centromeric regions.
Plant centromeres are usually composed of repetitive sequences, including LTR-RTs and tandem repeats 31 . In the A. thaliana genome, centromeres contain megabase-long islands of 178-bp tandem satellite DNA repeats (CEN180) and Athila retrotransposons [31][32][33] . The Hi-C interaction heatmap surrounding centromeric regions may have a lack signals, commonly referred to as "blank regions", due to the high density of repeats. Sequencing of repetitive regions with high GC content remains challenging using the Illumina platform 34,35 . In addition, aligning sequence reads of high repetitive content is difficult 34,35 . To predict centromeric regions in the horseradish genome, we integrated the Hi-C interaction heat map, the density distribution of satellite DNA repeats, and the density distribution of LTR-RTs. The LTR-RTs were predicted using RepeatMasker with a custom library constructed by the EDTA pipeline 36 . Tandem repeats with monomer lengths ranging between 80 and 2000 bp were identified using TRF software 37 , using parameters "2 7 7 80 10 50 2000 -f -d -m -l 15". A sliding window approach with a window size of 100 kbp and a sliding step size of 100 kbp was used to calculate the density of repeats in the genome. Using the density of repeats and the Hi-C interaction heat map, we estimated the approximate location of 16 centromeric regions, ranging from 2.  To determine the repetitive elements in the centromeric regions, clustering of repeat monomers was performed for each chromosome using CD-HIT-EST software 27 . Tandem repeat monomers showing more than 80% similarity were grouped into single-sequence clusters. The most abundant tandem repeat clusters within the centromeric regions were identified as the most abundant tandem repeat monomers in the centromeric regions (Supplementary Data 4). We observed that the "blank region" on the Hi-C interaction map contained a significantly high density of tandem repeats, particularly in the centromeric regions. This observation is consistent with findings from other T2T genome assemblies, such as the kiwifruit 38 and faba bean 39 genomes. Specifically, in chromosome b04, we identified a super-long centromeric region (~18.5 Mbp) harboring megabase-long islands of four highly abundant tandem repeats (CEN194-1, CEN194-2, CEN194-3, CEN195) (Supplementary Data 4). In contrast, we did not observe similar super-long centromeric regions or such highly abundant tandem repeats in the centromere of chromosome a04, indicating significant divergence of the homoeologous centromeric regions. To further investigate the structure of the centromeric regions, we used an approach similar to that described for the human genome using StainedGlass software 40,41 and visualization of the complex tandem repeats within the centromeric regions revealed the distinctiveness of the high-order structures the two subgenomes ( Supplementary Figs. 17-18). Methylation levels in the vicinity of centromeric regions were strikingly high, consistent with the presence of repetitive elements and previous studies of centromeric regions (Supplementary Fig. 19).
We scanned the five most abundant LTR-RT families in centromeric regions of all chromosome. We found that certain LTR-RT families with the highest coverage were found on different chromosomes. For instance, FAM7 (unknown classification) was found on 12 different chromosomes, while FAM13 was found on seven different chromosomes (Supplementary Data 5). LTR-RT families with the highest coverage belonged to Gypsy-CRM, Copia-Ale or unknown classification. Interestingly, we found that FAM1 (unknown classification) occupied approximately 35.58% of the centromeric region on chromosome b04, a significantly higher ratio compared with other chromosomes. In contrast, FAM16 (Gypsy-CRM) occupied nearly 11.47% of the centromeric region on the homoeologous chromosome a04 (Supplementary Data 5). Based on these findings, we hypothesize that the combination of the high density of CEN194/195 and the presence of FAM1 contributed to the origin of the super-long centromeric region on chromosome b04. We used the branch length of the phylogenetic tree based on the orthologous genes to partition subgenomes. The Arabidopsis thaliana genome was used to identify synteny blocks and orthologous gene pairs for each scaffold. Orthologous gene pairs were subjected to multiple sequence alignment and concatenated within each scaffold. Phylogenetic trees were then built based on the concatenated sequences. red: A subgenome; blue: B subgenome.

Supplementary Fig. 7. Confirmation of the presence of duplicated genomic blocks F and R by comparative chromosome painting using Arabidopsis thaliana BAC clones on mitotic chromosomes of A. rusticana.
Three times the experiment was repeated independently with similar results. scale bar, 10 µm. Supplementary Fig. 8. Pairwise comparison of contigs obtained using PacBio HiFi with the chromosome-level genome assembly based on ONT and HiC sequencing.
x-axis: chromosome-level genome; y-axis: PacBio HiFi contigs. The presumably lost segments are highlighted. Supplementary Fig. 10. The Hi-C interaction heatmap and the density distribution of the repeats along the chromosome a01-a04. The blue lines represent the density of the interspersed repeats within 100-kb sliding windows. The red lines represent the density of the tandem repeats within 100-kb sliding windows. The boundaries of contigs and scaffolds are indicated by green and blue lines, respectively. Source data are provided as a Source Data file. Supplementary Fig. 11. The Hi-C interaction heatmap and the density distribution of the repeats along the chromosome a05-a08. The blue lines represent the density of the interspersed repeats within 100-kb sliding windows. The red lines represent the density of the tandem repeats within 100-kb sliding windows. The boundaries of contigs and scaffolds are indicated by green and blue lines, respectively. Source data are provided as a Source Data file. Supplementary Fig. 12. The Hi-C interaction heatmap and the density distribution of the repeats along the chromosome b01-b04. The blue lines represent the density of the interspersed repeats within 100-kb sliding windows. The red lines represent the density of the tandem repeats within 100-kb sliding windows. The boundaries of contigs and scaffolds are indicated by green and blue lines, respectively. Source data are provided as a Source Data file. Supplementary Fig. 13. The Hi-C interaction heatmap and the density distribution of the repeats along the chromosome b05-b08. The blue lines represent the density of the interspersed repeats within 100-kb sliding windows. The red lines represent the density of the tandem repeats within 100-kb sliding windows. The boundaries of contigs and scaffolds are indicated by green and blue lines, respectively. Source data are provided as a Source Data file. Fragmented LTR retrotransposons (LTR-RTs) were annotated with RepeatMasker using the sequences of intact LTR retrotransposons as the library. The alignment coverage (the length of fragmented LTR retrotransposons/the length of intact LTR-RTs) obtained from the output of RepeatMasker. Fragmented LTR-RTs were classified into three classifications based on their alignment coverage. The methylation levels around different types of LTR-RT fragments were profiled using Deeptools.

Supplementary Fig. 33. The methylation levels (CpG) around different types of LTR retrotransposon fragments (coverage >50%).
Fragmented LTR retrotransposons (LTR-RTs) were annotated with RepeatMasker using the sequences of intact LTR-RTs the library. The alignment identity between intact LTR-RTs and identified fragmented LTR-RTs was obtained from the output of Repeatmasker. The fragmented LTR-RTs (alignment coverage >50%) were classified into three classifications based on their alignment identity. The methylation levels around different LTR-RT fragments were profiled using Deeptools.

Supplementary Fig. 34. The relationship between differentially methylated gene pairs and fragmented LTR retrotransposons.
a) The statistics of the genes haboring LTR-RT fragments in exon, intron or upstream region (-500 bp). b) The statistics of the A-subgenome genes haboring LTR-RT fragments in exon, intron or upstream region. c) The statistics of the B-subgenome genes haboring LTR-RT fragments in exon, intron or upstream region. d) The distribution of the LTR-RT fragments in the differentially methylated gene pairs that exhibited higher methylation level in the Asubgenome. e) The distribution of the LTR fragments in the differentially methylated gene pairs that exhibited higher methylation level in the B-subgenome. Source data are provided as a Source Data file. Supplementary Fig. 35. The statistics of the differentially methylated gene pairs between two subgenomes of horseradish.
a) The numbers of the differentially methylated genes dominant (with significantly greater methylation levels) in A and B subgenomes. We defined a gene pair as differentially methylated if at least one gene belonged to the hyper-or hypo-methylated genes, along with a ratio of methylation level greater than 2. The gene pairs were further classified into three groups: (1) one hypo-methylated gene (methylation level <1%) of a gene pair; (2) one hypermethylated gene (methylation level >90%) of a gene pair; (3) one hyper-methylated and one hypo-methylated gene of a gene pair. b) The expression heatmap of the differentially methylated gene pairs. c-d) The KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway enrichment analysis of differentially methylated genes dominant (with greater methylation levels) in A and B subgenome. A one-sided Fisher's exact test was adopted and adjustments were made for multiple comparisons with Benjamini and Hochberg method. Source data are provided as a Source Data file. Supplementary Fig. 36. The topologically associated domains (TAD) diversification between two subgeomes.
a) The schematic graph of the comparison of TAD boundaries between two subgenomes. The four genes nearby the TAD boundaries were extracted and defined as TAD-associated genes (TAD-g). Within the syntenic blocks, the TAD-g in syntelogs can reflect the diversification of TAD boundaries between two subgenomes. b) Veengraph of the TAD-g in syntelogs of each subgenome. The overlapped region indicates the TAD-g in syntelogs. Source data are provided as a Source Data file